We might think that the effect of one variables depends on the level of another.
In these cases, we can investigate interaction effects.
Binary by Binary
The effect of gender on earnings might be different across professions. In our data, we can look at the effect of gender on earnings in across Law and Medicine.
How do we get them?
Just multiply the two variables together
AND include the two variables use to make the product
\(y_i = \beta_0 + \beta_1 x_i+ \beta_2 z_i+ \beta_3 x_iz_i + \epsilon_i\)
Rather than fitting different intercepts for groups:
Assumptions
Assumptions
We can look at voting for black suffrage in Iowa. In your PS, you looked at the relationship between enlistment rates in the Civil War and the change in voting for suffrage between 1857 and 1868 (pre- and post- war). Does the effect of having veterans in a county depend on their military service?
lm_iowa = lm(Suffrage_Diff ~ enlist_pct*mean_combat_days, iowa)
summary(lm_iowa)$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.63811648 0.170041550 3.752709 0.0003426834
## enlist_pct -0.57666230 0.518725622 -1.111690 0.2698222442
## mean_combat_days -0.01041461 0.005167198 -2.015524 0.0474328834
## enlist_pct:mean_combat_days 0.03141232 0.015877741 1.978387 0.0515576856
We can center both variables at \(0\) to make it easier to interpret:
iowa$enlist_pct_c = iowa$enlist_pct - mean(iowa$enlist_pct, na.rm = T)
iowa$mean_combat_days_c = iowa$mean_combat_days - mean(iowa$mean_combat_days, na.rm = T)
lm_iowa_c = lm(Suffrage_Diff ~ enlist_pct_c*mean_combat_days_c, iowa)summary(lm_iowa_c)$coefficients## Estimate Std. Error t value
## (Intercept) 0.4493171668 0.009831222 45.7030844
## enlist_pct_c 0.5244885119 0.124801151 4.2025936
## mean_combat_days_c 0.0001431642 0.001283075 0.1115789
## enlist_pct_c:mean_combat_days_c 0.0314123210 0.015877741 1.9783873
## Pr(>|t|)
## (Intercept) 1.639784e-56
## enlist_pct_c 7.203651e-05
## mean_combat_days_c 9.114554e-01
## enlist_pct_c:mean_combat_days_c 5.155769e-02
Now main effects are interpretable.
For continuous interactions, we need to calculate the marginal effect
marginal effect: unit effect of \(x\) on \(y\) at a given value of \(z\).
\(y_i = \beta_0 + \beta_1 x_i+ \beta_2 z_i+ \beta_3 x_iz_i + \epsilon_i\)
Marginal effects have their own standard errors.
\(Var(aX + bZ) = a^2 Var(X) + b^2 Var(Z) + 2ab Cov(X,Z)\)
\(Var(\beta_1 + z_i \beta_3) = Var(\beta_1) + z_i^2 Var(\beta_3) + 2z_i Cov(\beta_1,\beta_3)\)
combat_seq = seq(min(iowa$mean_combat_days),
max(iowa$mean_combat_days),
length.out = 100)
mfx = lm_iowa$coefficients['enlist_pct'] +
lm_iowa$coefficients['enlist_pct:mean_combat_days']*combat_seqStandard Errors:
cov_mat = vcov(lm_iowa)
m_var = diag(cov_mat)['enlist_pct'] +
combat_seq^2*diag(cov_mat)['enlist_pct:mean_combat_days'] +
combat_seq*2*cov_mat['enlist_pct','enlist_pct:mean_combat_days']
mse = sqrt(m_var)
l = mfx - 1.96*mse
u = mfx + 1.96*mseContinuous-Continuous; Continuous-Binary interactions require big assumptions about:
Hainmueller et al show these problems are very common:
interflex package to more flexibly model interactions.require(interflex)## Loading required package: interflex
a = inter.raw(Y = 'Suffrage_Diff',
D = 'enlist_pct_c',
X = 'mean_combat_days_c',
data = iowa,
theme.bw = T)require(interflex)
b = inter.binning(Y = 'Suffrage_Diff',
D = 'enlist_pct_c',
X = 'mean_combat_days_c',
data = iowa,
theme.bw = T,
na.rm = T)Solutions to omitted variable bias
adjustment: using statistical tools like regression to remove biases, making parametric assumptions (linearity, no extrapolation, etc.) and assumptions about ignorability (all relevant variables controlled).
design-based: choosing cases for comparison that remove sources of confounding in the nature of the comparison.
In 1956, the state of Connecticut responded to high rates of automobile fatalities by imposing harsher penalties for speeding.
To evaluate the efficacy of this “crackdown” on speeding, researchers compared automobile deaths before and after the policy change.
What kinds of omitted variables does this comparison address?
What kinds of omitted variables does this NOT address?
Which problems does this figure address?
Design that formalizes previous image:
Estimate (linear) interrupted time series:
\(Y_i = \beta_0 + \beta_1 event_i + \beta_2 time_i + \beta_3 event_i \times time_i + \epsilon_i\)
Where \(event_i\) is \(0\) prior to the event, \(1\) after; \(time_i\) is number of days (or whatever unit of time) since the event (e.g., \(0\) on the day, \(-1\) the day before, \(1\)) the day after.
Can top-down change in police oversight reduce bias in policing?
What is the point of this?
What is the point of this?
What is the point of this?
How does Mummolo address each of these concerns?
Central limitations:
Extend the interrupted time-series:
If parallel trends assumption holds, what kinds of confounding does this design eliminate?
Implementation:
Example: Card and Krueger (2000)
Do increases in the minimum wage increase unemployment in fast food?
Two ways to use regression:
\[Y_it = \beta_0 + \beta_1 treat_i + \beta_2 post_{t} + \beta_3 treat_i \times post_{t} + \epsilon_{it}\]
Where \(treat_i\) is an indicator for being a unit that is ever treated. \(post_t\) is an indicator for the observation being after the treatment takes place (1 if yes, 0 if no).
\[Y_{post} - Y_{pre} = \beta_0 + \beta_1 treat_i + \epsilon_i\]
How do we validate the parallel trends assumption?
Placebo Tests:
If we have multiple time-periods, multiple cases, we get generalized difference-in-differences:
\[Y_{it} = \alpha_i + \alpha_t + \beta X_{it} + \epsilon_{it}\]
This model is equivalent to the following:
\[Y_{it} - \overline{Y_i} = (\beta X_{it} - \overline{X_i}) + (\epsilon_{it} - \overline{\epsilon_i})\]
Or taking the difference of \(X\) and \(Y\) of each case from its mean. Any variables \(Z_i\) that do not vary over time are removed (and thus cannot confound).
Example:
Example:
Example:
summary(lm(y ~ x, df))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.293580 0.70374291 3.259116 0.004357004
## x 0.209703 0.05874732 3.569576 0.002190512
Example:
In this case, the overall relationship between \(X\) and \(Y\) is positive. But within each unit (\(a \dots e\)), the relationship is negative!
Unobserved factors might make each unit have higher levels of x and higher levels of y, but as x increases within a unit (where these unobserved factors are constant), y decreases.
fixed effects allow us to extract this within unit relationship between \(X\) and \(Y\)
Contrast to pooled effects where we compare all observations to each other, not accounting for any unit-specific effects
How to use fixed effects
We add dummy variables for each unit
What does this do?
Fixed effects results:
summary(lm(y ~ x + g, df))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.061294 0.18447238 27.43660 1.428493e-13
## x -1.036417 0.05217667 -19.86362 1.179835e-11
## gb 5.077768 0.27854719 18.22947 3.763229e-11
## gc 10.321080 0.45635947 22.61612 2.019308e-12
## gd 15.498153 0.65272991 23.74359 1.038617e-12
## ge 20.685745 0.85496530 24.19484 8.026786e-13
What was the effect of enlistment in the US Civil War on voting for the Republican party?
\[GOP_{ie} = \alpha_i + \alpha_e + \beta Enlist_i \times PostWar_e + \epsilon_y + \epsilon_i\]
A placebo test/relaxed assumptions
\[GOP_{ie} = \alpha_{i} + \alpha_{e} + \sum_{y = 1854}^{1920} \beta_y EnlistmentRate_i * Year_y + \epsilon_i + \epsilon_y\]
Key Assumptions:
Caveats: